[parquet] Add row group index virtual column#9117
Conversation
073e3d5 to
0c9e12d
Compare
|
@Jefffrey if you have some bandwidth, I'd be curious to get your thoughts |
|
cc @alamb |
I'll see if I can take a look at this PR soon, though I will say it's been a while since I looked at the parquet codebase 😅 |
|
This would be sweet! I like the idea of using an extension type as a marker, and letting the caller customize the column name. |
| fn read_records(&mut self, batch_size: usize) -> Result<usize> { | ||
| let starting_len = self.buffered_indices.len(); | ||
| self.buffered_indices | ||
| .extend((&mut self.remaining_indices).take(batch_size)); |
There was a problem hiding this comment.
nit: I know this is how the row index reader did it, but since that code merged I learned that Iterator::by_ref is a thing.
| .extend((&mut self.remaining_indices).take(batch_size)); | |
| .extend(self.remaining_indices.by_ref().take(batch_size)); |
It's not shorter, but does seem more readable?
(more below)
| if metadata.is_some_and(str::is_empty) { | ||
| Ok("") | ||
| } else { | ||
| Err(ArrowError::InvalidArgumentError( | ||
| "Virtual column extension type expects an empty string as metadata".to_owned(), | ||
| )) | ||
| } |
There was a problem hiding this comment.
nit: is a match simpler?
| if metadata.is_some_and(str::is_empty) { | |
| Ok("") | |
| } else { | |
| Err(ArrowError::InvalidArgumentError( | |
| "Virtual column extension type expects an empty string as metadata".to_owned(), | |
| )) | |
| } | |
| match metadata { | |
| Some(&"") => Ok(""), | |
| _ => Err(ArrowError::InvalidArgumentError( | |
| "Virtual column extension type expects an empty string as metadata".to_owned(), | |
| )), | |
| } |
or even
| if metadata.is_some_and(str::is_empty) { | |
| Ok("") | |
| } else { | |
| Err(ArrowError::InvalidArgumentError( | |
| "Virtual column extension type expects an empty string as metadata".to_owned(), | |
| )) | |
| } | |
| if let Some(&"") = metadata { | |
| return Ok(""); | |
| }; | |
| Err(ArrowError::InvalidArgumentError( | |
| "Virtual column extension type expects an empty string as metadata".to_owned(), | |
| )) |
Jefffrey
left a comment
There was a problem hiding this comment.
Looks reasonable to me 👍
0c9e12d to
0c019db
Compare
alamb
left a comment
There was a problem hiding this comment.
Thanks @friendlymatthew and @scovich and @Jefffrey -- this is really cool. I can't wait to get this integrated into downstream projects
|
|
||
| fn consume_batch(&mut self) -> Result<ArrayRef> { | ||
| Ok(Arc::new(Int64Array::from_iter( | ||
| self.buffered_indices.drain(..), |
There was a problem hiding this comment.
Most of the time each ArrayReader will be instantiated for pages from exactly one row group we can probably make this significantly faster by optimizing the case when reading from only a single row group.
The other thing would be to pre-calculate the arrays (where the row group values don't change) and return the same ArrayRef until the RowGroup actually changes
However, there is no reason we need to do that now.
There was a problem hiding this comment.
Filed:
- perf: optimize
RowGroupIndexReaderfor single row group reads #9180 - perf: cache
ArrayRefinRowGroupIndexReaderwhen row group doesn't change #9181
I wonder if we should have the same optimization for RowNumberReader 🤔
|
|
||
| fn skip_records(&mut self, num_records: usize) -> Result<usize> { | ||
| // TODO: Use advance_by when it stabilizes to improve performance | ||
| Ok((self.remaining_indices.by_ref()).take(num_records).count()) |
0c019db to
4437dc6
Compare
There was a problem hiding this comment.
Thank you @friendlymatthew @Jefffrey and @scovich

Which issue does this PR close?
Rationale for this change
This PR adds support for storing row group indices as a virtual column, allowing users to determine which row group each row originated from
The usage pattern is quite simple, something like: